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Koltchinskii is to be congratulated for developing a unified framework. 
This elegant framework is general and allows a user to apply it directly 
instead of deriving bounds in each risk minimization problem. In the past 
decade, the problem of risk minimization has been extensively studied in 
function estimation and classification. In function estimation, it has been 
investigated using the empirical process technique under the name of min- 
imum contrast or sieve estimation in, for instance, [2, 3, 7, 12, 13, 15]. In 
classification, it has been studied in a similar fashion; cf. [1, 6, 10, 11]. The 
general framework derived in this article yields an upper bound of the excess 
risk through local Rademacher complexities. When applying such a frame- 
work to a specific problem, attention is necessary with regard to the specific 
problem structure that may matter greatly. 

Our discussion will be centered in two aspects: (1) the role that the vari- 
ance and mean play, particularly in classification, and (2) practicability of 
an empirical complexity. 

1. The role of variance and mean. 

1.1. Variance-mean relationship and the margin condition. As noted in 
the paper, one key idea to recover the optimal rate of convergence is to 
bound the local complexity E supj^jr. p(^j-_f^^g \{Pn — P){f — f)\ instead of 
the global one. This is achieved by bounding Var{f — /), or sufficiently the 
second moment P{f — /)^, by the mean P{f — /). Such a variance-mean 
relationship was essentially used in [7] in a slightly more general form of 

(1) VarifiX) - fix)) < a[E{f{X) - f{X))f^ 
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for some constants a > and /3 > 0, on which an iterative improvement ap- 
proach is employed to derive fast rates of convergence by exploring sup^^ | {Pn — 
P){f — f)\ over local set A. This is analogous to the fixed point approach 
used in the present paper. 

In what follows, we argue that (1) is more fundamental than the pop- 
ular margin (low noise) assumption (cf. [10]) commonly used in classifica- 
tion, resulting in fast rates of convergence. In classification, (1) summa- 
rizes not only the local behavior of the optimal decision function near its 
classification boundary but also the global behavior of the optimal deci- 
sion function, whereas the margin assumption describes only the former. 
To be more specific, consider Tsybakov's classification example as discussed 
in Section 6.1 of the present paper. Following the notations of [10], de- 
note Z to be the input /output pair {X,Y) with X € R*^ and Y = 0,1. 
Let Q* = {G C R'^} be a class of classification sets. Now define f = fc 
to be I{Y 7^ I{X G G)), which is the 0-1 classification error loss for set 
G G* given Z. Easily, it can be seen that the margin assumption (Al) 
in [10] implies (1) with /? = ^^'^ assumption (A2) there implies the 
L2{P) bracketing entropy oi T = {fc'-G G G*} to be of order e~'^^ with 
< p < 1. Then an application of Theorem 2 of Shen and Wong [7] with 
a = 1/2, (3 = 1/(2k) and r = 2p yields a rate of convergence of the excess risk 
pifc^Jc*), or the Bayesian regret, n-i/(4a-mm{a,/3){2-r)) ^ ^-k/{2k+p-i) ^ 

which agrees with the result of Tsybakov [10] and the present paper. Here 
pUgJg*) = p{fj) = EifciZ) - fG*{Z)) with / = fc^iZ) defined by the 
optimal classification set G* . This example indicates that (1) is actually 
weaker than the margin assumption (Al). Furthermore, (1) continues to 
be a key assumption for risk minimization in classification even when (Al) 
breaks down, as in linear SVM with the hinge loss. This is because in this 
case / no longer approximates the Bayes rule in the sense of [10]. 

1.2. Variance-mean relationship in margin-based classification. Consider 
an equivalent version of (1) in regression and classification: 

(2) Var{l{YJ{X))-l{Y,f{X)))<a[E{liYJiX))-l{Y,fiX)))f^, 

where {X, Y) is an observation pair, / is a loss function and / is a parameter 
in J^. 

The present paper nicely illustrates importance of the variance-mean rela- 
tionship in least squares regression in which (3 = 1/2 in (2). In classification, 
however, the situation is much more complex. As illustrated in [14], (2) may 
depend on the choice of loss functions and J-'. For simplicity, we assume 
y = ±1 as opposed to 0,1 in what follows. For ■0-learning [8], l{y,f{x)) = 
I[yf{x) < 0] + (1 - y/(x))/[0 < yf{x) < 1], (2) holds with /3 = l/(2«;), where 
K is the exponent given assumption (Al). As a result, the aforementioned 



DISCUSSION OF LOCAL RADEMACHER COMPLEXITIES 



3 



fast rate n~'^^^'^'^^^~^^ in Section 1.1 can be realized by ^/^-learning, pro- 
vided that the bracketing L2 entropy of the class of candidate classification 
sets induced by decision functions is of order e"^'' for < p <1. For SVM, 
l{y,f{x)) = [1 — yf{x)]^ and hence (2) is met with /? = 1/2 generally for 
a finite-dimensional linear space J-'. However, when is sufficiently rich, 
/3 = 1/2 for the separable case but is essentially for the nonseparable case. 

1.3. Variance-mean relationship when the candidate function class T is 
large. Another phenomenon worthwhile mentioning is that (1) may not 
be useful for improving the excess risk bound when T is very large. For 
instance, when the metric entropy of T is of the order e~'^f with p = 1, a 
rate n~^/^log?i can be realized (Theorem 2 of [7]), where /3 in (1) does not 
enter into the rate expression. This is in contrast to the fast rate obtained 
in Section 1.2. A situation like this occurs in the Li-norm SVM variable 
selection with the number of candidate variables d greatly exceeding that 
of the sample size n, where T = {f = 6^x : ||6'||i < s,9 G R"^} with a tuning 
parameter s > 0, and the entropy of J- is of order of (log d)e~'^ . This leads to 
a rate of {n~^ logd)^/^ log(n(log(i)~^) (cf. [14]), as long as d grows no faster 
than exp(n). 

2. Empirical complexity as a way of model selection. With regard to 
model selection, the author advocates a model selection criterion through 
penalization that mimics an oracle inequality of some type (Section 5), which 
is an upper bound of the excess risk. The selection criterion, or a data- 
dependent upper bound, estimates the oracle inequality. For a model selec- 
tion criterion constructed in this manner, several important issues remain. 
First, an oracle (upper) inequality through a concentration inequality could 
be rough in the sense that the difference between the upper bound and the 
actual excess risk is large, although it dramatically simplifies the process of 
estimating the excess risk. Consequently, the optimal model selected by the 
model selection criterion may be inaccurate due to the bias introduced by 
an imprecise upper bound of the excess risk, particularly in the finite-sample 
situation. This phenomenon has been noted in [5] when AIC and BIC were 
compared against Vapnik's structural risk minimization via a penalty based 
on the VC-dimension, with respect to the accuracy of prediction. Second, it 
appears rather difficult to track the constants in the penalties theoretically 
and empirically. Theoretically, vf„(/c) and K in the oracle inequality may be 
imprecise in that many "numerical constants" are suited for the purpose. 
Then can these terms be optimally determined? Empirically, it seems un- 
necessary that TTn{k) and K are precisely estimated by 7r(fe) and K; even 
an inconsistent estimator can give the desired result. While a rate of con- 
vergence result is useful in providing an insight into the problem of model 
selection, estimation of an overly simplified oracle inequality may not be 
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precise enough to determine all required constants — further developments 
may be needed. 

The foregoing discussion brings up an interesting and important point: 
how to balance mathematical tractableness and the accuracy of risk estima- 
tion. We now turn our attention to covariance penalties in the framework 
of model selection via penalization, which directly estimates the risk/loss 
based on optimal predictive estimation. Covariance penalties that are ap- 
proximately unbiased for estimating the risk are shown to be more precise 
in prediction than their competitor cross-validation in [4]. A general con- 
struction of covariance penalties can be found in [4, 9] in a family of Q-error 
losses including the Kullback-Leibler loss and the 0-1 classification error 
loss. General estimation methods for covariance penalties include bootstrap 
and data perturbation. In contrast, covariance penalties are less tractable 
theoretically than the penalties in an oracle inequality. 
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